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ABSTRACT 


Across defense, homeland security, and law enforcement communities, leaders face the 
tension between making quick but also well informed decisions regarding time-dependent 
entities of interest. For example, consider a law enforcement organization (searcher) with a 
sizable list of potential terrorists (targets) but far fewer observational assets (sensors). The 
searcher’s goal being to follow the target, but resource constraints make continuous coverage 
impossible, resulting in intermittent observational attempts. We model target behaviour as 
a discrete time Markov chain with the state space being the target’s set of possible locations, 
activities, or attributes. In this setting, we define “following the target” as the searcher, at 
any given time step, correctly identifying and then allocating the sensor to the state which 
has the highest probability of containing the target. In other words, in each time period the 
searcher’s objective is to decide where to send the sensor, attempting to observe the target 
in that time period, resulting in a hit or miss from which the searcher learns the target’s true 
transition behaviour. We develop a Multi-Armed Bandit approach for efficiently following 
the target, where each state takes the place of an arm. Our search policy is five to ten times 
better than existing approaches. 
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Executive Summary 


Across the defense and homeland security communities, decision makers are faced with the 
tension between making quick but also well-informed decisions regarding issues of 
interest that change over time. 

Consider, for example, a law enforcement organization with a sizable list of potential terror¬ 
ists but a limited number of observational assets. We designate these potential terrorists as 
targets and the observational assets as sensors. The sensors being patrol officers, cameras, 
or even small drones. Because of the disparity between the number of targets and available 
sensors, continuous coverage is impossible resulting in intermittent observational attempts 
(hourly, daily, or even weekly). The goal of the law enforcement organization, the searcher, 
is to learn the baseline behaviour pattern for each target as quickly as possible. Once a 
reasonable baseline is established, the searcher would shift to some form of change-point 
detection, in order to detect if the target is planning an attack or just going on vacation. 

In this thesis we examine how to quickly establish a behaviour pattern baseline, providing 
a method that consistently outperforms the Naive version in expectation. We model the 
target’s behavior as a discrete time Markov chain. The state space being the target’s 
location, activity, or any specific attributes that change with time. In the simplest scenario, 
the searcher has one sensor and in each time period decides where to send the sensor, 
attempting to find the target, and resulting in a hit or miss from which the searcher learns 
the target’s transition behaviour. In general, the searcher’s decision variables are the sensor’s 
locations (i.e., states of the Markov chain) over time. The searcher’s objective is to allocate 
the sensor dynamically so as to learn the target’s behaviour pattern as quickly as possible. 
Figure 1 depicts a simple four state example transition kernel that defines the behaviour 
pattern for a target. We focus on the House to Cafe transition (piq). 

We develop a Multi-Armed Bandit approach for efficiently following this target, where each 
state takes the place of an arm. Our search policy is five to ten times better than existing 
approaches as can be seen in Figure 2. This figure corresponds to our method’s performance 
on the target defined in Figure 1. The black line being the true probability and the blue line 
being our method’s estimate at each time step \n\. 


xv 




Figure 1. Example Four State Transition Probability Graph. The House to 
Cafe Transition Being Bolded and Corresponding to the State 1 to State 4 
Transition. 


Averaged Estimated p 1 4 vs time 
( N = 1 k steps with 100 replications on 4 state system ) 



Figure 2. Estimated Four State Transition Probabilities (Focused on the 
House to Cafe or State 1 to 4 Transition) vs. Time (Mean with 95% Con¬ 
fidence Intervals). Comparison Between Current Naive Methods and our 
Single-Miss MLE Approach. Generated in MATLAB. 
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CHAPTER 1: 
Introduction 


This chapter provides the backdrop for this thesis. It not only defines the problem, why it is 
important, and how it is currently being solved, but it also provides the ground rules. 


1.1 The Problem 

We consider a situation where a searcher attempts to locate and maintain observation of a 
target (e.g., terrorist, pirate ship, aircraft debris, endangered animal, or generically, a target 
of interest). We model the target’s behavior as a discrete time Markov Chain (a layman’s 
explanation of this can be found in Chapter 2, Section 2.1). The state space being the target’s 
location, activity, or any specific attributes that change either with time or in some discrete 
or sequential fashion (e.g., bank account activity). In particular, the states can be physical 
locations, radio communication activity, the IP address of the computer used by the target, 
or even the target’s current bank account levels or some specific flagged expenditures. 

In the simplest scenario, the searcher has one sensor to follow the entity of interest. For 
instance, at time t = 7 the target can be in locations (or generically, states) a , b , or c. If 
the searcher, at t = 7, allocates the sensor to location c and the target also transitioned to 
that state, then the searcher earns a reward of one. The searcher’s decision variables are 
the sensor’s locations (i.e., states of the Markov chain) each time step, using past sensor 
response information to guide future decisions. 

The objective of the searcher (the entity attempting to follow the target) is to allocate the 
sensor dynamically so as to achieve as close to constant observation (sensor positively 
observing the target at each time step) as possible. If the target’s transition matrix was 
known, then the searcher would simply position the sensor on the state with the highest 
probability given the last observed state of the target (which is not necessarily the last time 
period) and the sequence of observed states which failed to reveal the target (the states that 
did not contain the target or the sequence of misses). In this thesis, we relax the assumption 
of a known transition matrix. Hence, the searcher attempts to learn the underlying Markov 
transition matrix of the target, constant observation being the goal. 


1 




1.2 Motivation: Why It Is Important 

As briefly mentioned in Section 1.1, our problem can be applied to a very broad range 
of topics or settings where some form of transition dynamics need to be learned. Here it 
is sufficient to highlight a couple that will serve as surrogates for the rest of the possible 
application areas. Specifically, we will expand upon our primary setting of a terrorist in 
a city as well as possible applications to learning the migratory patterns of endangered 
animals (birds being our choice of example) and identifying potential pirates hiding within 
a fleet of innocent fishing vessels. 

Suppose there is a person of interest (the target) living in a large city and the authorities need 
to determine his/her behavior patterns in order to efficiently (minimum number of assets and 
as quickly as possible) track or maintain observation. However, in a typical scenario there 
exist many such targets, meaning that resource limitations preclude persistent surveillance. 
To efficiently track the target, the searcher employs various sensors depending on where it is 
believed the target is currently located. Another reason for intermittent search is to prevent 
the target from realizing he is being tracked. Authorities might therefore intentionally limit 
the use of sensors following the target, while attempting to periodically regain observation 
of the target. This observation might be having a human asset walk by the front of a cafe, 
glancing inside to see if the target is there or not. We model the target’s behavior as a discrete 
time Markov Chain with the state space comprising of all the various physical locations that 
the target might visit. These might be static locations such as the cafe mentioned above or 
the terrorist’s house but could also include more dynamic states such as the target being in 
a vehicle or on the subway. This setting will serve as our primary motivation, and thus it is 
important to keep it in mind to provide a framework on which our algorithms will be built. 

1.2.1 Migratory Patterns of Endangered Birds 

Instead of attempting to follow a potential terrorist in a city (the person of interest or target), 
imagine the goal is to learn the migratory patterns of an endangered bird. In this setting we 
would model the state space primarily to cover such things as latitude and longitude (each 
state corresponding to a physical location and a small set of activities). The searcher would 
then apply our approach on this specific flock of birds, learning over time the basic seasonal 
patterns of this bird species. 
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1.2.2 A Pirate Hiding in a Fishing Fleet 

In this setting, which is much closer to our primary scenario, the searcher is attempting to 
learn the overall behavior pattern of a fishing fleet in order to determine if a pirate is hiding 
in the fleet. In this setting the searcher would replicate our algorithm for each “fishing” 
vessel, updating these algorithms with a drone or Unmanned Arial Vehicle (UAV) network 
or mesh. Each UAV would update all algorithms based on which vessels it can see during 
that time step. 

1.3 Current Solution Methods 

An important question that needs to be raised though is, how is this problem currently being 
solved? Essentially, if this thesis did not exist, what would people use instead? While an 
answer to this question, by nature, cannot be comprehensive it will still be beneficial to 
examine how someone might attempt to learn the behaviour pattern of a terrorist within 
a city without the algorithms developed in this thesis. The naive (not intended here as 
derogatory) approach to following a target is to estimate the transition matrix by observing 
the target’s transitions. For example, if the searcher knows the target is currently in state a, 
the searcher would then look in state b during the next time step. Every time the searcher 
observes or fails to observe the target, more information is gained. After accumulating 
a number of hits and misses (assuming stationary target behaviour), the searcher can use 
the current time step’s estimated transition matrix to decide where to put the sensor next. 
The naive method for generating this estimated transition matrix is to use the ratio of hits 
(observed transitions) to total attempted transition observations as the probability for the 
target to make that specific transition. 

Initially of course, the searcher will not have a sequence of hits and misses for use in 
estimating the target’s behaviour so will have to begin with pure exploration (i.e., assume 
uniform probabilities as the first estimate). Later on, after obtaining a sequence of hits and 
misses, placing the sensor in the most likely target’s location (pure exploitation) may result in 
poor performance. A more sophisticated approach would include a term to force exploration 
of the whole state space (so that learning takes place). This and other approaches are fleshed 
out in Chapter 3: Methodology. We develop an algorithm that includes the sequence of hits 
and misses plus a judiciously chosen inflation term to force exploration that consistently 
learns faster. 
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1.4 Our Approach 

Our goal is to develop an algorithm that learns faster than the naive approach. Specifically, 
we apply methods from the following fields of research: Applied Probability, Optimization, 
Online Learning, and Statistics. We start with an ideal situation where we not only know 
the transition dynamics (or probabilities) but also have some form of Oracle which provides 
us the actual location or state the target transitioned to if we miss it. We then sequentially 
relax the assumptions for this situation until we get to the point where we do not know the 
transition dynamics at all and do not have an Oracle. This of course is more fully discussed 
in Chapter 3: Methodology. 

Our algorithm leverages not only the ratio of hits over total observation attempts (hits and 
misses) for a single transition probability but also the fact that each miss contains quite a 
lot of information regarding other transitions. This additional information comes from the 
assumption that we designed the state space to be comprehensive. In other words, if the 
target is currently at state a, it must either stay in that state or transition to another state 
within the state space. We assume that the target cannot jump out of the state space (i.e., 
we have defined the state space to be exhaustive). 

Therefore, in the situation just described, if the searcher allocates the sensor to look in 
state a again but the target is not there, we can distribute the weight of this miss (in the 
naive approach this means just increasing the denominator of the ratio; in our approach this 
weighting is calculated by an optimization problem) across all the rest of the states (as a 
miss for the a to a transition but as a partial hit for the a to x transition, x being all non -a 
states). 

Intuitively, the two-state example is the easiest to see. If our state space is {a, b} and the 
target is currently in state a, then if it does not transition back to a we know implicitly 
that it must have transitioned to b. Therefore, instead of just updating the a to a transition 
probability with a miss, we also update the a to b transition probability with a hit. In the 
two-state setting a miss provides just as much information as a hit. 
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1.5 Measures of Success 

How will we know if we have succeeded? We measure our success by our algorithm’s 
expected regret as compared to the naive approach mentioned above in Section 1.3. We 
define expected regret as the cumulative difference between our estimate of the target’s 
transition probability and the true probability over time. Further, we seek to provide upper 
bounds (i.e., worst case bounds) on the performance of our algorithm, namely on the 
expected regret growth over time. This is explained in more detail in Chapters 4 and 5. 

1.6 Ground Rules: Scope, Limitations, and Assumptions 

Before going any further, it will be helpful to lay out a scope for this thesis, some limitations, 
and a few assumptions we are making in our approach. In this thesis we develop an algorithm 
that utili z es and optimizes over the one-step misses (hence, the name “Single-Miss MLE”) 
returned from a sensor which only provides binary responses, as mentioned in Section 1.2. 
We define a one-step miss as the situation where the searcher knows the location of the 
target in the previous time step but missed it in the current time step. If the searcher missed 
it again in the next time step, that would be a two-step miss. 

Additionally, we assume that the state space is comprehensive (i.e., the target cannot “jump” 
out of the state space); we only examine sensors that have neither false positive nor false 
negative rates; we only consider discrete time; we assume that the target’s behavior is 
stationary (i.e., the transition matrix doesn’t change over time); we only consider one target 
and one sensor; and we assume that the sensor is unobserved by the target. Relaxing any 
of these last assumptions or limitations would make very good extensions and we hope to 
explore them in future work. 
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CHAPTER 2: 

Background and Literature Review 


The purpose of this chapter is to orient the reader to where this thesis research fits into 
the broader discipline of Operations Research as well as provide a common background of 
concepts that we will leverage and build upon. The topics or areas of research that intersect 
with our problem are: Persistent Surveillance, Search and Detection Theory, Markov Chains 
from Probability Theory and, Online Learning, specifically the Multi-Armed Bandit (MAB) 
approach to Machine Learning. The first two are parallel efforts that we wish to assist by 
attacking our problem from an Online Learning approach using the setting of a Markov 
Chain that will provide a more general range of settings than usually seen. 


2.1 Markov Chains 

In this section we refresh the reader on some of the basics regarding Markov Chain theory. 
This is necessary as we intend to leverage the power and flexibility of the Markovian approach 
enabling us to effectively model a wide range of real-world search and observation problems. 
Much of this material is a summary from the incredibly useful “Probability Models for 
Practitioners” [1] class notes written by Professor Kyle Lin from the Naval Postgraduate 
School (NPS). 

Here we take a moment to define a Markov Chain for the reader. A discrete time Markov 
Chain is a sequence of random variables X\, Xo,..., indexed by time taking values in some 
state space, with the property that future values X t+ \, X t+ 2 ,... are only dependent on the 
current state X t = x, and therefore conditionally independent of the past. We call this 
independence the Markov property. While this may seem like a rather large assumption 
to make, if needed, we can embed information about the past into the current state thereby 
maintaining this assumption without losing the mathematical power of the memoryless 
property of Markov Chains. 
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More formally, a discrete time Markov chain is a stochastic process (X t : t = 1,2,...) 
taking values in a discrete state space S = {1,2, ..., 5 }, that satisfies the Markov property, 
meaning that 

P(X t+ 1 € A | X U ..., X t ) = P(X t+ 1 6 A | X t ), 

for A c S, so that the distribution of X t depends on the past only through X t -\. Appendix 
B provides a simple example of a discrete time Markov Chain. 

2.2 Machine Learning 

In this section, we delve into the topic of Machine Learning, specifically its sub-discipline, 
Online Learning. We give an overview of Online Learning and then delve into some specific 
methods that we use in this thesis including the MAB Problem, Thompson Sampling (TS), 
the classical Maximum Likelihood Estimation (MLE) Point Estimation method, and finally 
the Upper Confidence Bound (UCB) algorithm for the Stochastic MAB problem. We 
include a short discussion of MLE because it is a critical component in our approach to 
estimating a given probability based on a sequence of data. 

2.2.1 Online Learning Overview 

Online Learning or Online Convex Optimization (OCO) is a sub-discipline of Machine 
Learning under which it was first defined. Hence, it primarily studies the performance of 
learning algorithms. As indicated by the second title, at heart it is optimization within a 
dynamic setting vice the standard deterministic setting and therefore has very broad applica¬ 
bility. As succinctly stated by Hazan, in his Introduction to Online Convex Optimization [2]: 


In many practical applications the environment is so complex that it is infeasible 
to lay out a comprehensive theoretical model and use classical algorithmic 
theory and mathematical optimization. It is necessary as well as beneficial to 
take a robust approach: apply an optimization method that learns as one goes 
along, learning from experience as more aspects of the problem are observed. 


8 



Of note, this conceptualization blurs the classic definitions or boundaries of deterministic 
modeling, stochastic modeling, and optimization methodologies [2]. As can be seen from 
the dynamic setting, the algorithm doesn’t have all of the information in the beginning but 
still must act. 

The basic framework, from Hazan [2], is that a player (the searcher in our setting) makes 
iterative decisions while online (think making decisions with partial information and ad¬ 
ditional information arriving as a stream). The underlying game structure is explained 
later in this section. When the player makes each decision, the outcomes (think penalty or 
loss) associated with those possible choices are unknown. Once the player commits to a 
choice, he/she will receive the amount of loss associated with that specific choice. Unlike 
Dynamic Programming, the player does not know in advance the losses associated with a 
given decision. Again, as mentioned above, these losses are unknown to the player prior 
to making his/her decision. While these losses can be dependent on the player’s choices, 
they could also be assigned by an adversary or opponent! The following restrictions must 
therefore be imposed to make this framework feasible and complete: 

1. The losses associated with the set of choices must be bounded. Otherwise, the adver¬ 
sary could decrease the scale of losses each iteration such that the player (algorithm) 
would never recover from the first loss. Therefore, the losses must reside within a 
bounded region. 

2. The set of decisions facing the player or algorithm must be somehow bounded or 
structured. This ensures that we have some sort of meaningful performance metric 
and prevents the adversary from assigning large losses to each choice made by the 
player (algorithm) indefinitely while separating a set of strategies that have no loss. 

Essentially, this framework can be viewed as a structured and repeated game [2]. The 
following notation helps solidify this framework. The set of decisions is a convex, nnon- 
empty, bounded, and closed set in Euclidean space % c R" with the costs being modeled as 
bounded convex functions over 'K [2]. At iteration / € T. T being the total number of game 
iterations, the player faces the set of decisions x t e %. After committing to a choice, the 
associated cost function is displayed or revealed: f, e T : % —» R. Hazan further defines 
T as being the bounded family of cost functions available to the adversary. The cost that 
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the player must now pay is ft(x t ). Our performance metric, taken from game theory, is the 
sum of the regret or difference between the lowest possible cost (from the cost function) and 
the one actually incurred by the player each iteration. We define this as the regret [2]. To 
formally define regret, consider an OCO algorithm, 3\ that maps an online player’s game 
history to a specific decision in the set of decisions over time [2]. This player’s or algorithm 
Sft’s regret after T iterations is formally defined by Hazan in [2], Equation 1.1, as: 


regret r (J?l) = < R T (tf I) = sup 


T 


t =1 


T 


- min 

xe'K 


Yj /tW 


( 2 . 1 ) 


From this definition of regret, we see that it is desirable for the regret to be sub-linear as 
a function of time or T. This setting or framework for online learning has become very 
popular recently primarily due to its powerful modeling capabilities [2]. Specifically, it can 
be used to model such diverse problems as online routing, advert placement and selection, 
and even spam filtering [2]. 


2.2.2 The Multi-Armed Bandit (MAB) Problem 

In this sub-section we delve further into the Online Learning framework with a specific, and 
rather popular variant, the Stochastic MAB. The Stochastic MAB is named after the “Single- 
Arm Bandit,” a Vegas slot machine, which still serves as one of the best ways to describe 
this problem. Of note up front, much of the material from this section is a summary of or 
taken directly from Mahajan and Teneketzis’ Multi-armed Bandit Problems [3] and Agrawal 
and Goyal’s Analysis of Thompson Sampling for the Multi-Armed Bandit Problem [4]. 

Imagine a player entering a casino and purchasing 20 “tokens” with which to play a row 
of “Single-Arm Bandits.” We call this row of slot machines a “Multi-Armed Bandit” with 
each slot machine’s lever or handle being an “Arm” of this MAB. Further, we assume 
that each slot machine or arm has the potential to have a payout or reward (based upon a 
potentially different underlying probability distribution for each arm) and each arm’s reward 
is a realization or sample from that distribution. Since the player, does not initially know 
which arm will give the best payout (i.e. has the most lucrative reward distribution), the 
player must begin by exploring the system, using a few tokens to compare the arms. At this 
point, the player has very rough estimates on the potential rewards from a few arms and 
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must decide to either exploit the best arm so far or continue to explore the rest of the arms 
with the finite tokens. This tension between exploration and exploitation is the heart of the 
MAB Family of Problems. 

Another way of thinking about this tension or describing this problem is to imagine a player 
with a single resource at each finite time step with which to allocate to one of a number of 
competing projects. Upon allocating the resource, that project changes while the rest stay 
static. Further, the reward or return on investment from that project is different than what 
each of the other projects might have returned. Hence, the MAB problems are a class of 
sequential resource allocation problems [3]. Further, most MAB algorithms use the player’s 
regret as their measure of effectiveness. This regret is essentially the same as that defined 
above in the Online Learning section. It is defined as the difference between the “best” arm 
that the player could have pulled that time step, had he known all of the distributions, and 
the one he/she actually pulled. Hence, the regret. 

There are of course many variants of this problem or ways to adjust the classic setup to fit 
numerous real-world situations such as: one or multiple resources available for allocation, 
new projects appearing over time, all projects changing each time step, or even an adversary 
who chooses the rewards of each arm. 


We base our formulation of the MAB problem on Agrawal and Goyal [4]. Consider a casino 
with % slot machines, each of these “arms” denoted by i e ( K. At each discrete time step, 
n = 1,2,3,..., the player must decide which of the 'K arms to pull. Each arm, i, returning 
a random, positive, real-valued, reward with support on [0, 1]. The rewards returned from 
each arm, immediately after pulling that arm, are independent and identically distributed as 
well as independent of the play of the other arms. Therefore, the player or MAB algorithm 
must decide which arm to pull, at time n, based on the rewards received (or in other words, 
the information obtained) up through time n - 1. Next, we define as the (unknown) 
expected or average reward for arm i, i(n) as the specific arm played at time step n, and 
Pi(n) as the (unknown) expected or average reward for the arm i pulled at n. 


The goal therefore, is to maximize the total expected reward by time N. We denote this 

r n n 


expected total reward by: 


Z /h’(rc) 


Since this measure doesn’t really tell us how well 


L n=l J 

our algorithm performs over time (due to no comparison with how well we could have done) 
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we instead use the equivalent measure of expected total regret. This regret, as mentioned 
above, is the difference between the optimal arm and i(n), the arm we played. To define this 
regret, K, let p* := max, //;, A, := p* - p, (which is always greater than or equal to zero by 
definition), and further, let kj(n ) be the number of times the algorithm has pulled arm i by 
time 77. Finally, we formally define this regret as follows: 


r N 


E[ft(iV)] =E 


L n= 1 


=^A,-E[fc,(A0] 


( 2 . 2 ) 


2.2.3 Thompson Sampling (TS) 

We will now examine the basic TS algorithm for the Bernoulli Bandit problem. This 
section’s material and notation (notation does not follow previous sections) is from Agrawal 
and Goyal’s Analysis of Thompson Sampling for the Multi-Armed Bandit Problem [4]. 

First, we examine a special case of the above Stochastic MAB problem where we consider 
the situation where the return or result from pulling an arm is Bernoulli or binary, i.e. 
hit or miss, success or failure. We model this response as simply 0 or 1. So, for arm i, 
the probability of success or of getting the reward (reward = 1) is pi. This special case is 
called the Bernoulli Bandit algorithm. This algorithm uses Bayesian priors on the Bernoulli 
means, s, for which Agrawal and Goyal propose the Beta family of distributions. They 
propose this because the Betas have support on the interval (0,1), are continuous probability 
distributions, and also enable a very natural posterior update; in other words, the Beta and 
Bernoulli distributions form a conjugate prior structure. The following is a quick summary 
of Beta distributions, followed by an exploration of the proposed Bernoulli Bandit algorithm. 

The Beta family of distributions, as mentioned above, are continuous probability distribu¬ 
tions with support on the interval (0,1). Below is a plot of some of these distributions for 
various ranges of the parameters. Their Probability Density Function (PDF), Beta(a, /?), 
with parameters a > 0 and (5 > 0, is given by: f(x; a, J3) = r ^p^ ) .v Q '~ 1 (l - x)P~ l , and 
their mean is given by //Beta = ffjj [4]. As can be seen from this equation, the higher the 
a and /3’s are, the lower the variance. You can see this decreased variance in Figure 2.1, 
specifically, the gold distribution as compared to say the light blue one. 
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Probability Density Function (pdf) 
( Beta Family of Distributions ) 



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 


Figure 2.1. The Family of Beta Distributions 

The basic TS algorithm uses the red distribution, in Figure 2.1, as the prior for each arm. 
Specifically, it assumes that arm i has a prior Beta(l,l) on //,. This is a natural choice as it 
essentially assumes a uniform distribution on the interval (0, 1) [4]. Of note, the following 
notation breaks from that defined in previous sections in order to more closely follow [4]. 
Next, at time t, having observed S^ t ) successes or hits (think, reward of 1) and F i(t) failures 
or misses (think, reward of 0) in ki( t) = 5,-( f ) + F i(t) plays of arm i, the algorithm updates 
the current distribution of //, to BctakSh,, + 1, F,( t) + 1). Lastly, the algorithm samples from 
these posterior distributions for the means of the arms, //,■’s, and plays the arm with the 
highest probability of having a success or in other words, the largest mean. This method 
from Agrawal and Goyal [4] is summarized in Algorithm 1, found in Table 2.1. 
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Algorithm 1: Basic Thompson Sampling for Bernoulli bandits 
For each arm i = 1,2,..., N set S; = 0, F, = 0 
for t = 1,2, ... do 

for each arm i = 1, ..., N do 

! Sample 0/(0 from the Beta(S/( f ) + 1, F/(o + 0 distribution. 

end 

Play arm i(t) := argmax,- 0,(0 and observe reward r t . 

if r t = 1 then 

| Set Si(t) = S/(£) + 1 

else 

| Set F i(t) = F l(t) + 1 

end 

end 

Table 2.1. Algorithm: Basic Thompson Sampling 

The basic idea behind this algorithm is that at each iteration or time step, the TS algorithm 
attempts to pull the arm with the largest probability of returning a reward of 1. The intuition 
here is that each reward of 1 increments the a parameter of the associated Beta distribution, 
shifting it closer to one, while each reward of zero increments the /? parameter of the 
associated distribution, shifting it closer to zero. To see this graphically, look at the light 
blue and purple distributions in Figure 2.1. Additionally, as the number of samples increases 
the variance of the resultant Beta distribution decreases as mentioned in the last paragraph. 
This can also be seen in Figure 2.1, specifically, the gold distribution as compared to the 
light blue one. Further, because it constantly updates the estimated Beta distributions, the 
algorithm performs well (specific convergence bounds for this algorithm can be found in 
Agrawal and Goyal’s paper, see [4]). This algorithm is then extended by Agrawal and Goyal 
to the general stochastic MAB setting (more information on this can also be found in their 
paper, see [4]). 
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2.2.4 Maximum Likelihood Estimation (MLE) 

Here, we will take a moment to examine a very useful method for estimating a specific 
parameter from a distribution. In our case, we are trying to estimate an unknown transition 
probability from the underlying Markov Chain’s transition probability matrix. The following 
summary and material is based on Jay Devore’s textbook, Probability and Statistics for 
Engineering and the Sciences [5] as well as Erich Lehmann and George Casella’s seminal 
text, the Theory of Point Estimation [6]. 

First introduced by R. A. Fisher, between 1912 and 1922, the method of MLE, as its name 
implies, attempts to accurately estimate a distribution’s parameter(s) of interest given only 
a finite number of samples from that distribution. Specifically, the likelihood function pro¬ 
vides us with how likely the observed samples are as a function of the possible parameter 
values. Then, by maximizing the likelihood function the MLE method returns the parameter 
values from which the observed data was most likely generated [5]. The following definition 
is taken directly from Devore [5]: 


Definition: Maximum Likelihood Estimator 


Let X\, X 2 ,..., X n have a joint probability mass function or PDF of 

fix 1 , X 2 ,...,x n \ 6 1 , 02 ,..., 6 m ) (2.3) 

where the parameters 6\, 0 2 ,..., 6 m have unknown values. When xi, x 2 ,..., x n are 
the observed sample values and (2.3) is regarded as a function of 9\, 9 2 ,.. ., 9 m , it is 
called the likelihood function. The maximum likelihood estimates 6\, 0 2 , ■ ■ ., 9 m are 
those values of the 9f s that maximize the likelihood function, so that 

fix i, | 0i,..., 9, n ) > fix i, | 0i,..., 9 m ) for all 0i,..., 9 m (2.4) 

When A/’s are substituted in place of xf s, maximum likelihood estimators result. 

Table 2.2. Definition: Maximum Likelihood Estimator. Reproduced from 
Devore’s Probability and Statistics for Engineering and the Sciences [5] 
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But, why use the MLE method instead of some other option? As briefly mentioned by 
Devore in [5] and closely examined and proved by Lehmann in [6], the MLE method has 
some very useful and important properties making it a good choice. These properties are: 

1. Asymptotic Consistency 

Under many conditions, MLEs converge in probability to the true parameter value, 
8. Lurther, by increasing our sample size, n, we can also achieve an arbitrary level of 
precision [6]. 

2. Asymptotic Efficiency 

This means that as the sample size n increases (tends towards infinity) under certain 
conditions the MLE converges to the true parameter value, 8, as fast as the quickest 
possible method. In other words, this method converges as quickly as theoretically 
possible. It achieves the so-called Cramer-Rao lower bound, which means that no 
consistent estimator can converge more quickly [6], [7], [8]. While other consistent 
estimators may match an MLE in convergence rate, they are not able to beat it. 

3. Asymptotic Normality 

Again, as n increases, the MLEs converge in distribution, under certain conditions, 
to a Gaussian (normal) distribution with the mean being equal to the true parameter 
value, 8, and a minimal variance [6]. Which, according to Devore is “as small as or 
nearly as small as can be achieved by any estimator.” [5] 
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2.2.5 Upper Confidence Bound (UCB) Strategies 

This section covers a critical strategy for developing theoretical upper-bounds on the MAB 
convergence rate or expected regret in a specific setting. This section’s material and notation 
closely follows Bubeck and Cesa-Bianchi’s Regret Analysis of Stochastic and Nonstochastic 
Multi-armed Bandit Problems [9] as well as loosely following Elad Hazan’s Introduction 
to Online Convex Optimization [2]. 

In order to examine this topic, it is necessary to start with some basic definitions from the 
Stochastic Bandit problem. From Bubeck and Cesa-Bianchi [9], each arm i e {1,.. .,K} 
is tied to an unknown probability distribution i The player, at each time step t = 1,... „ 
picks an arm I t e {1,..., A'} receiving a reward X/ t t from the probability distribution vj t . 
This reward being independent of the past. Further, we denote the mean of arm i with pj 
and then define the optimal mean and arm below, which are unknown to the player: [9] 

p* = max pi and i* e argmax p\ 

i=h-,K i=l,...,K 

Next, we define pseudo-regret following Bubeck and Cesa-Bianchi [9] as: 

n 

R n = np* -E^p It (2.5) 

t =l 

If i* was known, then the agent would simply pull that arm in each iteration in order to 
minimize the pseudo-regret. Of note, the p f ’s in Equation 2.5 are the actual means from 
those arms’ distributions which are unknown by the agent. 

We let 7/(5) = X; = , !/,=(', indicating the number of times that the player has selected arm i 
within the first 5 time steps, and we let A, = p* - pj, indicating the sub-optimality parameter 
of arm i (or the regret due to this arm having a larger penalty than the optimal arm). The 
idea is that pulling an arm with large A, induces a large pseudo-regret. Of course, the 
agent doesn’t know the values of A;, but they exist and are well defined as long as the mean 
rewards are finite. 
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Following Bubeck and Cesa-Bianchi in [9], we assume that the distribution of rewards X 
are light-tailed and hence, there exists a convex function if/ on the real numbers such that, 
for all A > 0, 


inEe /l(X ~ E[X]) < if/(A) and InEC 1 ™ “ X) < if/{A) (2.6) 


Following Bubeck and Cesa-Bianchi [9] further, the UCB algorithm can be applied to 
any light-tailed distribution, meeting the conditions defined in Equation 2.6, by forming 
an index for each arm, and then pulling the arm with the largest index estimated so far. 
Each index has two components. The first being the reward’s sample mean obtained by 
pulling arm i for 5 times, ju^ s = j X, v = i Xi,t- The second being an inflation term (this is 
the “upper confidence bound” part) selected so that the probability of a suboptimal index 
being larger than the optimal index is suitably small. In the following index, a is any 
positive constant, the optimal value calculation being examined in further detail in [9]. 
Also, from convex analysis we denote if/*{e) as the Legendre-Fenchel transform of the 
function if/ as: if/*(e) = sup /leR (Ae - if/(A)) where for example, if if/(x) = e x then 
if/*(x ) = .v ln(.i') - x, V.v > 0. Our index then, at time /, is defined as 


fiiji(t-t) + (<A*) 1 


a inf \ 
Ti(t- l)j 


for each arm i, where if/*(-) is the large deviations rate function corresponding to distribution 
Vi, and Tj(l - 1) is the number of pulls of arm i by time t — 1. With this, one can show that 


P fiiJAt-1 ) - (<A + ) 


-1 


a Inf 

m- 1) 


> w < t 


which implies that the regret grows similar to log t. 


The question remains though, what is the function if /*(-)‘? If, Vj has bounded support (the 
relevant scenario for this thesis) over {-b,b), for 0 < b < oo, then, as Bubeck and Cesa- 
Bianchi mention in [9], one can use if/(A) = The Gaussian case - important for several 
applications - leads to if/(A) = where cr is the variance constant. It is possible to get 
suitable bounds for if/*(-) under further assumptions, but this lies outside the scope of this 
thesis. 
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In summary, the algorithm pulls arm 


It 


6 argmax 
i=l,...,K L 


1) + OA*) 1 


a \nt \ 
Ti(t- I)) 


at time t. By using this strategy the algorithm achieves the following regret upper-bound as 
examined and defined by Bubeck and Cesa-Bianchi in [9]: 


Theorem: Pseudo-regret of the ( a , i/O-UCB Strategy 

Assume that the reward distributioi 
a > 2 satisfies 

R„ < 1 

i: Aj>0 

ns satisfy Equation 2.6. 

f aAf lnfnj i ^ ) 

fhen (a, i//)-UCB with 

f) a - 2) 


Table 2.3. Theorem: UCB Pseudo-Regret. Reproduced from Bubeck 
and Cesa-Bianchi’s Regret Analysis of Stochastic and Nonstochastic Multi¬ 
armed Bandit Problems [9] 


The key intuition from this theorem, is that the sub-optimal arms are sampled at a natural 
logarithm of time rate while the optimal arm is sampled at a rate of time minus the natural 
logarithm of time. The inflation term therefore, ensures that every arm is sampled preventing 
the algorithm from getting “stuck” on a sub-optimal arm that just happened to have a long 
run of good rewards. It is well known that eliminating the inflation term from the index 
leads to a regret that grows linearly, as opposed to logarithmically. 
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2.3 Related Studies 

The following are seven studies that utilize similar Machine Learning approaches to intel¬ 
ligence collection operations and are therefore worth mentioning even though their topics 
are focused on slightly different applications. 

1. Costica in [10], examines methods for reducing the congestion generated during the 
most time consuming stages of the intelligence cycle, namely the classification stage. 
He proposes a tandem queue based optimization model as an analytic solution to this 
problem. 

2. Nevo in [11], builds upon Costica’s work by further studying this intelligence cycle 
bottleneck. He formulates the problem as an exploitation-exploration trade-off be¬ 
tween good known intelligence sources and raw or unexplored sources that may or 
may not be valuable. 

3. Ellis in [12], develops a software library implementing Nevo’s previously generated 
mathematical model of information selection in an Online Learning setting, specif¬ 
ically, the intelligence cycle mentioned above. Further, he tests the performance of 
these different algorithms in a social communications network setting. 

4. Tekin in [13], analyzes applying Online Learning methods to a couple of the early 
stages of the intelligence cycle, namely, the collecting and processing stages. He 
assumes that the intelligence products arrive sequentially such that Online Learning 
algorithms are a realistic approach. Specifically, he developes a modified Thompson 
Sampling algorithm to solve for the optimal arm to select given the most recent sam¬ 
ples analyzed. 

5. Marshall in [14], approaches the collection, processing, and analyzing stages of the 
intelligence cycle from a MAB Allocation (MABA) framework. Specifically, this 
framework models the problem as a “novel finite horizon Bayesian stochastic dy¬ 
namic programming problem...” [14]. Further, he utili z es a novel Lagrangian based 
index heuristic for source, or arm, selection. 
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6. Hepworth in [15], investigates leveraging quantile, or superquantile, risk under a loss 
constraint in the context of a MAB setting. Specifically, he applies his algorithms 
in an intelligence collection setting where each arm of the MAB corresponds to a 
particluar item or document which may yield significant or little to no value to the 
intelligence analyst. He develops two sequential elimination algorithms which “select 
the most important source for a given constraint level, sampling from the arm(s) with 
the largest conditional expectation over a quantile” [15]. 

7. Grant in [16], attacks a significantly different problem then those listed above but 
utilizing similar methods. He focuses on the UAV Search Problem, specifically, al¬ 
locating UAVs to various sub-regions or boundaries in order to optimize detection 
of events of interest. This problem is, of course, of great interest to the intelligence 
community at large. He approaches this problem along three broad avenues: Intensity 
Estimation, Optimization, and Machine Learning. 
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CHAPTER 3: 
Methodology 


In this chapter, we examine the basic problem and our approach to its solution. As mentioned 
in Chapter 1, we consider a scenario where a searcher attempts to locate and maintain 
observation of a target. We model the target’s behavior as a discrete time Markov Chain. 
The associated state space being the target’s location, activity, or any specific attributes that 
change over time or in some sequential manner. The searcher has only one sensor with 
which to observe/follow the entity of interest, receiving a reward of one each time step 
that it detects the current state of the target. In general, the searcher’s decision variable 
is the sensor’s next location (i.e., a state of the Markov chain) over time. The objective 
of the searcher is to allocate the sensor dynamically so as to earn the largest expected 
total reward over some finite time horizon N, with the current discrete time step being 
n 6 N = {1,2,..., N}. There are four basic settings for this problem, which are listed 
below. We delve briefly into the first two since they are the simplest settings but focus the 
majority of our effort on the last two as they are the most insightful. 

• Known Transition Dynamics with an Oracle 

• Known Transition Dynamics without an Oracle 

• Unknown Transition Dynamics with an Oracle (Naive and Single-Miss methods) 

• Unknown Transition Dynamics without an Oracle (hardest and most interesting) 

For the last two policies (those without an oracle) we assume that our Markov Chain is 
irreducible (i.e., the target will never enter a “sink” state or “sink” subset of states). This 
ensures that it is possible for us to regain observation of the target once we lose track of 
it. Hence, if we (the searcher) keep looking, at say, state 1, then at some point we will 
eventually regain observation of the target. We define our state space as X = {1,2,... , L}. 
Further, we use the following notation to indicate states within the state space: x, i, i, j ef. 
Of note, Appendix A summarizes and lists the notation used within Chapters 3 and 4. 
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3.1 Known Transition Dynamics with an Oracle (KTO) 

In this first and very basic setting, Known Transition Dynamics with Oracle (KTO), we 
assume that after each time step, the searcher is given, by the Oracle, the true move¬ 
ment/location of the target. For example, if, at n = 5, the searcher looks in state i but the 
target actually transitioned to state j, the searcher is given that information before moving to 
the next decision/time step, n = 6. This enables us to easily resolve the rewards and choose 
where to send the sensor for the next time step. Further, we also assume that the underlying 
transition probabilities are known, making this setting primarily an exploitation problem 
removing the standard exploration-exploitation tension. In this setting the searcher always 
places the sensor on the mode or state with the highest transition probability. 


3.2 Known Transition Dynamics with No Oracle (KTNO) 

In this section, since the searcher knows the true transition probabilities, we just condition 
on the last known location of the target and the subsequent sequence of misses to determine 
the most likely state to which to send or allocate the sensor. In essence, we power up the 
sub-matrices of P. 

More precisely, suppose the target was last seen in location x\ in period 1, the sensor misses 
the target in states X 2 , X 3 , , a„_i. The question for the searcher is: Where to place the 

sensor in period n given the last known location and the sequence of misses (i.e., the sample 
path X\ = x\, X 2 ^ X 2 ,..., X n -i ± x n -\)l For xeXwe consider 

XV {X n = x\X n -\ ^ x n —\,..., X 2 X 2 , X\ = Al) 

_ XV {X n — x, X n -\ ^ x n —i ,..., X 2 X 2 , Xi = xi ) 
Pr{X n -1 t x n -u ..., X 2 ± x 2 , X x = x\) 

The goal is to find the most likely state in period n, so it suffices to find the maximizer of 
the numerator above. That is, we want to maximize 

Pr(X n = x, X n _i t x n -i, ...,X 2 ± x 2 ,Xi =x{) = P - ■ P - ,-••••' P x - _ 

A 1 A 1 *A 

1 ’ 2 2 ’ 3 n-V 

for all x 6 X, where P XuX - is row xi without the element P Xu x 2 °f Ih e transition matrix, 
P x - x - is the sub-matrix formed by removing row ao and column A 3 , and P x - x is the a fill 
column of P without the a„_i entry of the transition matrix. 
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As an example, consider a transition probability matrix over the states {1,2,3}. 

1 2 3 

l [ 0.1 0.2 0.7 

P = 2 0.5 0.2 0.3 

3 [ 0.8 0.1 0.1 

Suppose that jti = 1. Since the mode of the first row is the third entry, the searcher places 
the sensor in state 3 in period 2. If the target is not found there, then X 3 =£ 3. Using the 
formula above leads to 

O.l' 

Pr(X 3 = l, X 2 * 3, X\ = 1) = [0.1,0.2] x =0.11 

0.5 


0.2 

Pr(X 3 = 2,Xo* 3, X\ = 1) = [0.1,0.2] x = 0.06 

0.2 

and 

0.7" 

Pr(X 3 = 3,Xo * 3, X\ = 1 ) = [0.1,0.2] x = 0.13 

0.3 

so the searcher is better off putting the sensor in state 3. As before, if the target is found 
then we set X 3 = 3, otherwise the searcher selects the state corresponding to the largest of 

0.1 0.2] [O.l' 

Pr(X 4 = l,X 3 ± 3,Xo * 3,Xi = 1) = [0.1,0.2] x x 0.041 

0.5 0.2 0.5 


0.1 0.2 0.2 

Pr(X 4 = 2,X 3 ± 3,Xo * 3,Xi = 1) = [0.1,0.2] x x 0.034 

0.5 0.2j [0.2 

and 


0.1 

0.2 

0.7" 

Pr{X 4 = 3 ,X 3 ± 3,Xo 3,Xi = 1) = [0.1,0.2] x 


X 

0.5 

0.2 

0.3 


0.095 
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Thus, as before, the searcher is better off placing the sensor in state 3 for period 4. The 
search proceeds along these lines for as long as the target is not found. As it turns out, it 
is optimal for the searcher to forever place the sensor in state 3 as long as the target is not 
found, since 



0.1 

0.2 

n -3 

0 . 7 " 

Pr(X n = 3, X n . x t 3, • • • , Z 3 t 3, Z 2 t 3, = 1) = [0.1,0.2] x 

0.5 

0.2 

X 

0.3 


is larger than the corresponding computation for states 1 and 2, for any n > 3. Since 
the matrix [0.1,0.2; 0.5,0.2] is sub-stochastic, it can be seen that Pr(X n ± 3, X n -\ 4 
3, • • • ,Xt, ± 3, X 2 ± 3, X\ = 1) —> 0, so the target must eventually be found by placing 
the sensor in state 3. Indeed, leaving the sensor static in (any) single state guarantees that 
the target is found, as long as the Markov chain is irreducible. Once found in state 3, the 
searcher is then better off placing the sensor in state 1 , and so on. 

These ideas can be extended to the case where the searcher has w sensors, 1 < w < L; 
when w = L the target is always found. The idea is to put the sensors in the most likely 
states conditioned on the initial known state and the sequence of misses. Specifically, for 
a sample path X\ = x\, X 2 ^ x 2 ,..., X n -\ ± x n _ 1 , with x 2 , ..., x„_i 6 £ w the most likely 
state in period n is found by finding the w maximizers of 

Pi (X n = X, X n -\ X n - 1, . . ., X2 4 1 X2, X\ = Xi) = P x 1 xt ' Px~x~ ■ ■ ■ Px~ x 

2 2 ’ 3 n-V 

for all x 6 X, where P xux - is row x\ without the elements P XUX2 of the transition matrix, 
P x -, x is the sub-matrix formed by removing rows xo and columns X 3 , and P x | x is the x’th 
column of P without the x n _i entries of the transition matrix. 
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3.3 Unknown Transition Dynamics with Oracle (UTO) 

In this third basic setting, Unknown Transition Dynamics with Oracle (UTO), we assume 
that after each time step, the searcher is given the true movement/location of the target, 
again from an Oracle, but this time he has to learn or estimate the true underlying transition 
probabilities. For example, if, in n = 5, the searcher looks in state i but the target 
actually transitioned to state j, the searcher is given that information before moving to the 
next decision/time step, n = 6. This enables a quick resolution of the rewards and easy 
calculation of the estimated transition probabilities for the next time step. We explore this 
setting in two ways. The first is more intuitive and a better method for this setting but is 
fragile or difficult to extend to our final setting. Therefore, we develop the second approach 
with a few different variants. As a final note, the developments in this section apply when 
the searcher has multiple sensors. 


Case 1: Full dependence on Constant Oracle 

We examine a Markov chain with unknown transition probability matrix, i.e., unknown 
transition dynamics but with an Oracle that provides the target’s location at the beginning 
of each time step (therefore we are only attempting to predict a single step transition). In 
this case, we do not need to utili z e a MAB approach. Instead due to the Oracle, we can 
update the elements of the empirical transition matrix at the end of each time step. Hence, 
the optimal action for the searcher is to place the sensor in the state that is the empirical 
mode out of the last (known state). 

Case 2: Naively Using the Oracle for Reset 

While the above setting is intuitive and works quite well, what if the searcher is unable 
to receive the Oracle’s information every time step and instead gets a periodic update (say 
every 6 or 12 time steps)? In the operational setting this could correspond to the target 
(terrorist) going home every night or a small fishing boat returning to harbour in the evening. 
In both of these cases, if a time step is defined as a single hour, the Oracle would provide its 
periodic update every roughly 12 hours. This motivates the following versions of the UTO 
setting. 
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In this approach we estimate the transition probabilities using only the hits’ information. 
Specifically, for each state-pair (i, j) e £ 2 we compute the ratio of the number of hits in 
state j when the target was just seen in state i (the state i —» j transition) up to time n. h\ n ), 
over the total number of sensor placements in (or views of) state j when the target was last 
in state i up to time n, u.y. Specifically, our initial estimator of pij = P(X t +\ = j\X t = i ) is 



h in) 

'j 
O) ’ 

V- ■ 

'•J 


(3.1) 


when ty 1 > 0. Of note, y" ) is the z’th, j’th element of Q in) , the estimated transition 
probability matrix at time n. Since this estimator does not use miss information, the £ ; q". ) 
may be strictly smaller than 1. Rescaling each row to sum to 1 produces the following 
estimator: 


h (n) I h' 

w = y - 

q 'J An) Zj v 

V i,j \AL v i,( 


, -1 


(3.2) 


As an example of these ideas imagine that, prior to rescaling, we have the below row of 
probability element estimates for a four state Markov chain: 


h 


(n) 

1 ,£ 


V 


(n) 

l,£ 


1 . A 5_ 12] 
28 16 27 41 J 


and recall that in this case the oracle discloses the last position of the target. This means 
that we can execute the below updates at each time step (hence, using the oracle only for 
reset). Let’s say we choose to allocate the sensor to state 4. If the target transitions from its 
current state, 1, to state 2 we would miss it and update the above row of fractions as follows: 

; (71+1) 

_ \J_ 2_ _ 5 _ 12 ] 

(rc+1) 1 28 16 27 42 J 

\£ 

If, on the other hand, the target transitions from its current state, 1, to state 4 we would 
receive a hit and update the above row of fractions as follows: 


h 


O+i) 

1,£ 


V 


O+l) 

l,£ 


1. A. A IT] 

28 16 27 42] 
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From this last row of fractions we would update the 1st row of our transition probability 
estimates, with the rescaled probabilities as follows: 


(n) 

%L 


'!_ 2_ 5_ 13' 
.28 16 27 42. 

(’-L + JL + .L + Jl) 

V 28 ^ 16 ^ 27 ^ 42 /' 


If the searcher can cover the entire state space in each iteration with sensors, then there are 
no misses, and the estimator defined above is the classical MLE obtained as the solution of 


max 

quh-quL 



subject to 


and 



= # transitions out of state i by time n 


j 


J] q Lj = 1, with qij > 0. 
j 


h (n) 

for each i e X, when the number of transitions out of state i is positive and also where q { 
is the probability q u j raised to the /z|” ’th power. Of course, when L - 2 (i.e., there are just 
two states), a miss gives as much information as a hit; the same is true when the searcher 
can cover L — 1 states with sensors. However, it is not clear how this generalizes to larger 
state spaces, even when just two states are not searched in a period (e.g., one sensor and 
L = 3). 


Case 3: Using the Oracle only for Reset 

While maximizing the likelihood of the hits’ observations is good, we aren’t distributing 
the density of the target’s movement to the numerator of our transition counters when the 
sensor returns a miss or zero. Remember the example from Case 2 that only updated the 
denominator for the miss. Intuitively, we know that a miss means the target transitioned to 
one of the other (unobserved) states but we do not update any of their associated fractions. 
This means that we are not using all of the information we could glean from each time step. 
To rectify this situation we maximize the likelihood of both hit and miss observations. 
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For ease of reading we drop the time step notation on all variables since this calculation 
is executed during a single time step. Since we are only looking at the single-step misses, 
the maximum likelihood optimization problem is separable into L optimization problems 
(below), one for each row of the transition matrix. The resultant likelihood function, 
capturing all of the hits and single-step misses so far, that we maximize is (abusing notation): 


^■'(<7/, t» 


qi , L ) = n q-y (1 

7=1 



For each i e X with Vjj > 0, we 


max L(qi' i, ... , q hL ), 

qi,i,-,quL 


with the following constraints: 

L 



7 = 1 


% j > 0, Vjex 

Setting q*j = 0 is optimal when hjj = 0, so henceforth we assume that 0 < hjj < Vjj. The 
resultant log-likelihood or objective function is 


L 

max log L(q jV ... , q iL ) = ^ h Uj \og(q Uj ) + (u LJ - h LJ ) log(l - </, V/ ). 

%i ' ’ 7=1 


The objective function is a sum of concave functions, and therefore is concave. The con¬ 
straint set, being a simplex, is convex. Hence, the Karush-Kuhn-Tucker (KKT) conditions, 
as listed by Dimitri Bertsekas in [17] and developed by William Karush in [18] and Harold 
Kuhn and Albert Tucker in [19], are sufficient for optimality. These are 


hj,j _ v Uj hjj 

q. . 1 - q. . 

*hj 0,7 


k + Aj, VjgX 


where k is the Lagrange multiplier for the sum constraint, and Aj > 0 is the Lagrange 
multiplier for the non-negativity constraint, q j ■ > 0. We have Aj =0 if q { ■ > 0, when 
hjj > 0 (true by assumption), so we can disregard these multipliers in the analysis below. 
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Since the partial derivatives are monotone decreasing with 


and 


Also 


h lJ 

v iJ hij 

% 

1 - q. . 

hij 

v iJ - hij 

% 

1 - %j 

at optimality since tl 

hij 

v iJ hij 

q iJ 

l - q iJ 


oo as i —> 0 


-oo as j —» 1. 


Therefore, all q. . > 0 at optimality since the Markov chain is assumed to be irreducible. 

*’7 


0 


hij 

q iJ = — ■ 


hi,j • 


so that q. . < — if and only if k > 0. Summing over all the q. .’s for this row we get 

^’7 Vi, / ^»7 


fc > 0 




—r v iyj 

7 = 1 


Based on the above results, there are only three possibilities, dependent on the value of the 

L 


h■ ■ 

sum of our transition counters for this row, Y —. These three possibilities or cases are 

• i V U J 

7 = 1 

enumerated as follows: 


L h- ■ h- ■ 

1. y -^ = 1, in which case the optimal solution is / = — • 

V Uj v i,j 

L h- 

2. y — > 1, so the optimal solution satisfies the root equation below for k > 0. 

7=1 Uj 


Solving for qij we get, 
hij Vij — hjj . 

— —{ -- = k <= 

Qij 1 _ Qij 

for all j 6 X. 


hij J qij — kQiJ kQij 


k qfj—(Vij +k) q^j+h iyj = 0 
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Hence, 


c IiJ ~ 


Vij + k± Vij + k)~ - 4khij 

~2k 


with k > 0. The radical is positive, since Vjj > h t J > 0 leads to 

(Vij + k ) 2 - 4khij > hjj + 2hjjk + k 2 - 4khjj = (hjj - k) 2 > 0 
Taking the positive root first, q we get: 

v L j + k + yj(vij + k ) 2 - 4 khtj 1 


q {+) = 


2k 


> 


which is not necessarily true. Therefore, we only deal with the negative root, q: 

1 f-t 

This boundary condition, Yj Q- ■ = 1, leads to the root equation in k given by 

./=! l ' ] 


l . - 

(L - 2 )k + vy - yj( Vij + k ) 2 - 4 khij) = 0, 

7 = 1 

with unique solution k*, by construction. In conclusion, the MLE’s for the transition 
probabilities are, 

Vij + k* - yj(Vij + k*) 2 - 4k*hj j 


C H,, 


2k* 


hi. 


3. In case Y — < 1, with k* < 0 in the root equation, the MLE is the same, 

• i i 

7=1 


c lhj ~ 


Vij + k* - >ij + k*) 2 - 4k*hj 
2k* 


h■ ■ 

but with k* < 0. The intuition is that the empirical transition probabilities -^ L , 
obtained by only counting the hits, are inflated to maximize the likelihood of hits and 
misses. 
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In conclusion, this MLE approach, while making use of all the observations, has two major 
limitations. Computationally, it requires solving a root problem after each new observation 
and operationally, it presumes the searcher receives a target location update each time period 
from the oracle. The former can be ameliorated by warm starting the root finding algorithm 
with the last solution and the latter leads to the next scenario. 


Case 4: No Oracle Present 

At the other extreme of the gamut, the searcher may never receive feedback from an oracle, 
making a MLE, including both hits and misses, substantially more difficult. The issue in 
this setting is that rather than having an optimization problem for each row of the transition 
matrix, due to the separability structure, there is a single optimization problem involving all 
the transition probabilities. In this subsection we sketch the ideas for a solution approach. 

The starting point is a sample path of hits and misses. Namely, let t\ = 1, and t 2 , t 2 ,..., t^ h 
be the (random) times where the target is located, where 1 < < n is the (random) 

number of times the target is found by time n. Then a sample path between the first 
two hits is X\ = x h X 2 i 1 x 2 , ■ ■. ,X T2 _i ± x T2 -i,X Tl = x T2 ; between hits two and three 
X T2+ 1 ^ x T 2 +i, X T2 +2 x T 2 + 2 i ■ ■ .,X T3 _1 ^ x T 2 -i, X T2 — x T2 , and so on. 

Inspired by Section 3.2, we maximize the likelihood by time (j, (for simplicity) 

L(qi,b ■, q\,L, • • •; qu\, • • •, 

= Pi (Xj^ = x T ^, X Ti .^—1 x T ^ n -i ,..., X T2 +\ x T2 +\,X T2 — x T2 , X T2 -\ ^ x T2 ~\ 

,...,X 2 * X2 ,Xi = X\) 

= Qx v x- ■ Qx-,x- ■■■■■ Qx-_ v x T2 ■ Qx Tr x ; 2+1 ■ ■■■■ Qx-, n _ r x r<n 

for all states x 6 X, where Qx v x 2 ' s row x \ without the element Q X] ,x 2 of the estimated 
transition matrix Q , Q x - tX - is the sub-matrix formed by removing row x 2 and column X 3 , and 
Q x - ^ Xt is the x T(ii ’th column of Q without the x~„ entry. The optimization problem 
leading to the ML estimator is 

max log qi L ; ...; q u 1 ,..., q L>L ), 

qij,qj)£-C 2 
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for each state pair (i,j) with Vjj > 0, with the following constraints: 

L 

Yj %j ~ l ' V/ 6 L 

7=1 

%j> 0, V/ e L 

The constraint set, being the intersection of L simplexes, is convex. The Hessian of the 
objective function can be shown to be negative definite, so that the objective function is 
concave. Hence, as in Case 3 above, the KKT conditions are sufficient for optimality. 
Unfortunately, the resulting root equations can’t be solved analytically in this case. We 
believe that an online algorithm, such as online gradient descent, would be useful in this 
setting, but leave as future work the problem of developing a more insightful approach. 

3.4 Exploration and Exploitation 

We answer two primary questions in this section. First, why not use the current estimate 
of the highest transition probability, i.e., the mode of a given row of Q in> to determine 
the sensor allocation? Second, how can the searcher incorporate uncertainty in a way that 
induces efficient exploration of all the states? We tackle these questions in the context of 
Cases 2 and 3 of the preceding section, for a searcher that has only one sensor. 

The main purpose of the resulting search rule is to ensure that the algorithm finds or learns 
the optimal arm or true mode of a given row of the transition matrix P, instead of narrowing 
in on a sub-optimal arm. A quick example will help demonstrate the need or motivation for 
this index policy. Consider the following true transition probability vector for state 1, 

pu: = [to to to to ) • 

Now, imagine that after a number of periods, say n, the estimate p\£ is 

[I I A I] 

L 5 4 20 4 J 

where ^ should be interpreted as 6 hits in state 3 out of 20 views in state 3 when the target 
was just seen (or reported by the oracle) in state 1. 
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If the searcher always selects the mode of this row to place the sensor then, since according 
to our current estimate it is our best option, we will continue to sample from that state, 
increasing the precision of this specific transition probability estimate. As n gets larger, the 
estimate remains close 0.3 with high probability, since that is in fact its true value and we 
will therefore never explore the rest of the arms of this row since they appear to be worse. 
But, we know that this is a sub-optimal arm since in fact p \4 = 0.4. This small example 
highlights the need for some form of exploration term to force the algorithm to continue 
exploring arms that appear to be sub-optimal at the time but that may in fact be optimal, as 
in the case above. 


The first step to derive an index that forces exploration is to bound the estimator error 
probability, Pr(]qK - pij\ > e), for e > 0. We now argue how this can be done, for the 
estimator qK in Cases 2 and 3 of Section 3.3. To keep the notation simple we omit the 
super-script (n) . 

For q/j as in Case 3 of Section 3.3, and k* > 0 

Vij + k* - J(vij + k *) 2 - 4k*hip 
Qij ~ Pij > e <=> -^- > Pi+ e 


<^=> + k* - yj(vij + k*) 2 - 4k*hjj > 2 k*{p Uj + e) 

» (Vij + k*- 2k*( P jj + 6)) 2 > (vij + k*) 2 - 4k*hij 


-4k*(pu + e){vu + k*) + (2k*(p U] + e)) 2 > -4 k*h L 


Hence, 


Kj - (Pij + e)(v U j + k*) + k*(pij + e)~ > 0 


k* < 


Kj ~ VijiPiJ + e) 


(Pij + e) - (Pij + e ) 


2 ' 


Pr(q U j - p^ > e) = Pr k < 


Kj - Vijipu + e) 


(Pij + e) ~ ( Pij + K 2 
< Pr [Kj - Vjjipij + e) > 0) < exp(-2 v u e 2 ). 


the last inequality by Hoeffding’s Lemma. 


(3.3) 

(3.4) 
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The same proof technique applies for k* < 0, leading to 


Pr(qjj -p Uj < -e) < exp(-2 v Uj e 2 ). 


Regarding the estimator of Case 2 without rescaling (c.f., Equation 3.1) in Section 3.3 and 
using the same approach we get 

Pr(qij ~Pij > e) = Pr - p Lj > ej = Pr (h Uj - Vijiptj + e) > 0) < exp(-2i^e 2 ), 

and similarly in the case of k* < 0, so we end up with the same bound as Case 3. 

From Equations 3.3 and 3.4 we see that the MLE approach is less conservative than the 
estimator obtained by only considering hits, which is further supported by the numerical 
work in Chapter 4. A more refined proof technique is needed to flesh out the gain derived 
by including the miss observations, but this is left as future work. 

From here the classical MAB index follows, 



for each i e X. A constant larger than 1.5 leads to more exploration than needed, while the 
opposite is true for a constant smaller than 1.5; see [20] for a derivation. The intuition behind 
Equation 3.5 is that each sub-optimal state is sampled logarithmically with the number of 
views from the source state, while the best state (i.e., the mode) is sampled the rest of the 
time. 


Given that the target was just observed in state i, the MAB algorithm proceeds by placing 
the sensor in the state with the largest index, 

argmax q tJ + 
j= f-T 

It is well-known (see Theorem 1 in [20]) that this approach produces an expected regret 
with an upper bound that is logarithmic with the number of views. 
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To complete the picture, for the estimator of Case 2 with rescaling (c.f., Equation 3.2) in 
Section 3.3 we get 



1 h t . 

h. e \ 

-l 

) 


Pr ^i,j~Pi,j > = Pr 

i>j 

i,t 

~ p i ,j 

> e 

) 

- Pr 


h. . 

hj 

V U 


> 



1 - (Pij + e) 


(Pij + e) ’ 


after working inside the parentheses. Using standard arguments, the right-hand side above 
becomes 


< Pr 


h. . 

hj 


v. . 

hj 



h. 


ue 


7/ 


< 1 



< Pr 


h. 




V U 



< Pu ~ 


e 



< exp(-2 v t j 6 2 




This bound can be used to produce an index similar to Equation 3.5, but with higher regret 
due to the larger upper bound on the probability of error. However, the numerical results in 
the next chapter suggest that the rescaled MLE has smaller regret (i.e., better performance) 
than the unsealed MLE, suggesting that the inequalities above are too loose. 

As a final note, while in this section we only considered a searcher with a single sensor, 
the developments can be extended to a multiple-sensor setting following the ideas of Chen, 
Wang, and Yuan in their “Combinatorial Multi-Armed Bandit: General Framework, Results 
and Applications” [21], who study the problem of how to optimally pull several arms 
simultaneously. 
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CHAPTER 4: 

Analysis and Simulation Results 


In this chapter, we examine the initial simulation results, validating the analytic bounds 
calculated in Chapter 3. Specifically, in a couple small examples we see eight to ten times 
faster convergence rates when compared to the naive approach. Of note, we will examine 
this comparison between the Naive method and our Single-Miss Algorithm throughout this 
chapter. Here is a quick outline or summary: 

1. Tiny System (4 States) 

2. Small System (10 States) 

To simulate our algorithm’s performance against the Naive method, we leverage MATLAB 
to simulate the target’s behaviour, based on each specific setting’s transistion probabilities, 
and then implement both methods to attempt to learn the target’s behaviour pattern or 
transition probabilities. Specifically, we capture, at each time step n, the estimated or 
empirical probability of the target transitioning from state 1 to state 4, or in notation, 

Further, we break the Naive method into two separate versions shown in the plots as 
Naive with normalization and Naive without normalization. The first, as its name implies, 
normalizes each row of the transition probability matrix at each time step while the second 
does not (this breaks the law of total probability but since the algorithm only needs the row 
of index values to determine the next sensor allocation the algorithm still functions). The 
non-normalized version instead, plots the raw ratio of hits to attempted observations or in 

7 ( n ) 

(ri) • 

notation it defines the estimated probability as: q. . = The reason for this becomes 

V Uj 

apparent in Section 4.1. Namely, the normalized version consistently over-estimates the 
target’s probabilites since the normalization process we use treats each estimated ratio as 
equally precise. 

In both Sections 4.1 and 4.2, MATLAB was used to simulate the bahaviour of the target 
using the standard built-in random number generator with each method being replicated 
100 times per method. The mean of these 100 simulations was then used to generate the 
95% Confidence Interval bounds denoted by the thin colored lines in the following figures. 


39 




4.1 Tiny System (Four States) 

In this section we examine simulated convergence rates of both the basic Naive and our 
proposed Single-Miss MLE methods on a very small and dense example. Specifically, we 
examine a four state system corresponding to a potential terrorist’s behaviour within a city. 
The four states being his house, a cafe, his workplace, and a store. His transition kernel or 
transition probability matrix is defined as: 



house (1) 

work (2) 

store (3) 

cafe (4) 

house (1) 

0.10 

0.20 

0.30 

0.40 

work (2) 

0.90 

0.02 

0.04 

0.04 

store (3) 

0.70 

0.03 

0.15 

0.12 

cafe (4) 

0.82 

0.06 

0.02 

0.10 


Since we are looking to learn the behaviour pattern of our target, we measure our per¬ 
formance on the mode of a row. In this case, we compare how quickly the methods can 
estimate the house to cafe transition probability, in notation the P l4 = 0.4 probability. Of 
note, the horizontal axis in Figures 4.1 through 4.4 is discrete time. In Figures 4.1 and 4.3, 
the vertical axis is probability with the thick black line marking the true probability, 0.4. 


Averaged Estimated p 1 4 vs time 
{ N = 1 k steps with 100 replications on 4 state system ) 



Figure 4.1. Four State Estimated Transition Probabilities vs. Time (mean 
with 95% confidence intervals). Generated in MATLAB. 
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As can be seen in Figure 4.1, our Single-Miss MLE method’s 95% Confidence Interval 
captures the true probability well before time step n = 50, the Naive without normalization 
method captures it around time step n = 500, and the Naive with normalization method 
doesn’t capture it until well after time step n = 1000. A quick calculation shows that our 
Single-Miss MLE method learns at least ten times faster than the better of the naive versions. 
Further, we can also see that the normalized Naive version does in fact over-estimate our 
parameter of interest. This interesting finding is further examined in Section 4.2. 


Additionally, we can see from Figure 4.2 that we also attain logarithmic learning rates 
with both MLE methods. At the same time though, the Single-Miss MLE method clearly 
outperforms both Naive versions in both the expected cumulative regret as well as the 95% 
Confidence Bounds on that mean. For this analysis, we define the expected regret (vertical 
axis in Figures 4.2 and 4.4) as the absolute difference between the true probability and the 
current estimate q in> , with these differences being summed at every time step. In notation: 



N 

which for state 1 to 4 is: ^ - 0.4 

n -1 


(4.1) 


Average Cumulative Regret (| true - est |) p 1 4 vs time 
( N = 1 k steps with 100 replications on 4 state system) 



Figure 4.2. Four State Expected Regret vs. Time (mean with 95% confi¬ 
dence intervals). Generated in MATLAB. 
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4.2 Small System (Ten States) 

In this section, we expand the state space from four to ten states. This might correspond to 
adding another location, maybe a bar, and splitting each of these five states into a day and 
night version. 

Below is a randomly generated transition kernel or transition probability matrix for the 
target’s behaviour modeled with ten states. The result of applying the algorithms to this 
behaviour pattern are noted in Figures 4.3 and 4.4. Of note in the below behaviour matrix, 
the /?| 4 probability is intentionally set to 0.4 for ease of comparison to the four state system 
analyzed in Section 4.1. 



1 

2 

3 

4 

5 

6 

7 

8 

9 

to 

1 

0 

0.16 

0 

0.40 

0.13 

0.10 

0 

0.10 

0 

0.10 

2 

0 

0.23 

0.16 

0 

0 

0.18 

0 

0.20 

0 

0.22 

3 

0 

0 

0.23 

0.22 

0 

0 

0.28 

0.26 

0 

0 

4 

0 

0 

0 

0 

0 

0 

0.53 

0 

0 

0.46 

5 

0.30 

0.11 

0 

0.11 

0.13 

0.12 

0 

0 

0.13 

0.09 

6 

0.40 

0 

0 

0 

0 

0.16 

0.16 

0 

0.10 

0.17 

7 

0.32 

0 

0.11 

0.12 

0.16 

0 

0.11 

0 

0.17 

0 

8 

0.64 

0 

0 

0.12 

0 

0.11 

0 

0 

0.12 

0 

9 

0.45 

0.13 

0 

0 

0.11 

0 

0 

0.10 

0.09 

0.11 

10 

0.61 

0 

0.12 

0 

0 

0.13 

0 

0 

0.13 

0 


Since we are looking to learn the behaviour pattern of our target, we again measure our 
performance on the mode of the first row. In this case, we compare how quickly the methods 
can estimate the state 1 to state 4 transition probability, or in notation p 14 = 0.4. The figures 
in this section following the same format as the previous section, namely the horizontal axis 
being discrete time. The vertical axis in Figure 4.3 is probability and the vertical axis in 
Figure 4.4 is the cumulative regret as defined in Equation 4.1. 
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Averaged Estimated p 14 vs time 
( N = 2k steps with 100 replications on 10 state system ) 



Figure 4.3. Ten State Estimated Transition Probabilities vs. Time (mean 
with 95% confidence intervals). Generated in MATLAB. 

Again, we see the estimation power of the Single-Miss method continues to outperform 
both Naive versions in Figure 4.3. In this setting though, the effect of learning from miss 
data is less pronounced. Knowing that the target did not transition to a given state, a miss, 
provides less information than a miss does in the four state system, since the probability 
density is distributed across nine states in this case vice just the three from Section 4.1. 

Figure 4.3, highlights even more strongly the fact that the normalized Naive version remains 
over-inflated for a very long period of time. This is due to the normalization process failing 
to account for the different precision of each ratio. Since the “sub-optimal” arms are looked 
at far less frequently (think far smaller sample size), their hits versus attempted views ratios 
are by nature less precise (have higher variance) than the more “optimal” arms. The intuition 
here is that as a sample size increases we decrease the variance of our estimated parameter, 
a specific transition probability in this case. Therefore, if we normalize without some form 
of a weighting scheme tied to the different arms’ precisions we end up over-inflating the 
mode as can be clearly seen in Figure 4.3. 
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Average Cumulative Regret (| true - est |) p 1 4 vs time 
( N = 2k steps with 100 replications on 10 state system ) 



Figure 4.4. Ten State Expected Regret vs. Time (mean with 95% confidence 
intervals). Generated in MATLAB. 

Figure 4.4 shows that the logarithmic learning rates or cumulative expected regrets are still 
achieved as the state space increases. We also see the very pronounced effect of over¬ 
inflation on the cumulative regret for the normalized Naive method. And lastly, we again 
see that the Single-Miss MLE method continues to outperform both Naive versions. 
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CHAPTER 5: 

Conclusion and Recommendations 


This chapter provides a quick summary of or conclusion to this research as well as exploring 
some areas of recommended future research or work. 


5.1 Conclusions 

This thesis presents a novel method for leveraging and applying Machine Learning methods 
and techniques to learning behaviour patterns modeled as discrete time Markov Chains. 
Further, the Single-Miss MLE algorithm obtains a logarithmic learning rate and performs 
significantly better in expectation than the Naive methods described in Chapters 3 and 4 
in the scenarios examined and presented in Chapter 4. At the same time though, given an 
appropriate exploration inflation term, examined and developed in Chapter 3, the Single- 
Miss MLE and Naive MLE methods all obtain the desired logarithmic rate. 

Chapter 1 laid out the background to this thesis and examined some different areas of 
research that intersect in an interesting way. Specifically, leveraging current Stochastic 
Multi-Armed Bandit theory as well as modeling a target’s behaviour as a discrete time 
Markov Chain. The intersection of these two fields of research in our problem and our 
resulting Single-Miss MLE algorithm is novel as far as the writers are aware. 

5.2 Future Work 

This section explores some proposed future work or extensions. These ideas being naturally 
tied to or flowing from assumptions and limitations mentioned in Chapters 1 and 3. 

Leveraging Multi-Step Misses 

The convergence rate gains obtained by leveraging the complimentary miss data, namely 
the Single-Miss approach versus the Naive approach, resulted in a significant performance 
increase. Therefore, extracting additional data from the multi-step miss sequences or strings 
should likewise yield additional performance improvements. 
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Impact of Density and Increased State Space Size 

As Chapter 4 highlights, the increased state space size resulted in a decreased effect or 
benefit from gleaning information from miss data. This indicates that further analysis 
should be conducted exploring the impact or effect of the state space’s size and resultant 
transition probability matrix density on the learning rate. 

Leveraging a Bayesian Prior 

Depending on how one defines the state space, some transitions may be completely im¬ 
possible while others are just highly unlikely. An example of the former is a potential 
terrorist transitioning from a cafe at 8 am to the airport at midnight in a single one hour 
time step which is impossible. Others may be highly unlikely based on current technology 
or some prior analysis of the target’s behaviour by the searcher. These therefore, provide 
the motivation for enabling or leveraging a Bayesian approach to this problem which would 
enable this prior knowledge to be captured. 


Noisy Sensor Responses 

What if the sensor returns a noisy response, think false negatives or positives? This might 
correspond to a patrol officer walking by the outside of a cafe during the morning rush. The 
patrol officer glances through the window but is unable to see everyone in the cafe. The 
officer then must give some sort of estimate of their certainty of the target’s absence from 
the cafe. The converse could also occur where a sensor thinks it observed the target but in 
fact did not. Implementing noisy responses from the sensor would decrease the abstraction 
of this problem and hence, increase the applicability of the resultant algorithm. 

Multiple Sensors 

Given a single target, what if the Law Enforcement Agency, the searcher, had multiple 
sensors to deploy or allocate? This concept could provide an interesting extension to the 
current Single-Miss MLE method where the number of sensors is still less than the total 
number of observable states. 
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Multiple Related Targets 

What if instead of multiple sensors, there were multiple targets that were somehow either 
spatially correlated or dependent? This would enable the algorithm to leverage data from 
multiple targets, potentially increasing the learning rate on both targets. 

Time-Horizon Weighting 

One key assumption of this thesis is that the target’s behaviour pattern remains stationary. In 
other words, the target’s behaviour does not change with time. What if we relax this assump¬ 
tion and allow the target’s behaviour to vary with time? This motivates the development of 
a time-horizon weighting scheme or method which values more recent observations over 
older data. This could also then be extended to form the basis for a change point detection 
algorithm. 
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APPENDIX A: 

Basic Mathematical Notation 


While not absolutely necessary, this appendix provides a quick-reference guide to the 
notation we use in Chapters 3 and 4. 


Sets 

l, i, j € X = {1, 2,. .., L}. Let L = |£| (total number of different states). And, let £~ be the 
state space complement of £ or all other states in X. 

n € N = {1,2,...}, being the current discrete time step. We also use k as a past time step. 

Data 

T'\ the target’s current location or state at time step n in . the current time step). 

P, true, underlying, transition probability matrix governing the target’s transition behavior. 

Pij, target’s probability of transitioning from state i to j, corresponding to the (/,_/)’th 
element within the matrix P. 


Variables 

S n , sensor’s current location or where it was sent at time n ( n , the current time step). 
v.., cumulative, by n. number of sensor allocations to j, given the target came from i. 

l ’J 

(n) 

hkj , cumulative, by n, number of target observations in j, given the target came from i. 

q^ n ), target’s estimated, as of n, probability of transitioning from state i to j corresponding 
to the (i, j)’th element within the matrix Q in k defined next. 


<2 ( "\ current, by n, empirical transition probability matrix estimating the true behavior. 


(n) 

K; . , cumulative regret, by n. between the estimated and true transition probabilities for the 


transition i,j. Specifically, we define this regret as: K 


n) 

tj 


Ek'f -Pi 


k= 1 
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APPENDIX B: 

Simple Markov Chain Example 


While this appendix also is not completely necessary, it is included to serve as a very quick 
introduction to discrete time Markov chains for those readers who do not have exposure to 
these concepts and ideas. Towards this goal, we take a moment to define a Markov Chain 
for the non-technical reader as well as provide a simple example to facilitate understanding. 

Layman’s Definition: A Markov Chain is a set of states or state space with associated 
transition probabilities that models an object’s stochastic or uncertain behaviour over time. 
It assumes that future transitions are only dependent on the current state of the object and 
therefore independent of the past. We call this independence the memoryless property 
of a Markov Chain or the Markov Property. While this may seem like a rather large 
assumption to make, if needed, we can embed information about the past into the current 
state thereby maintaining this assumption without losing the mathematical power of the 
memoryless property of Markov Chains. The following is a more succint definition using 
mathematical notation: 

Mathematical Definition: A Discrete Time Markov Chain is a stochastic (think uncertain) 
process X t : t = 1,2,... taking values in a discrete state space S = {1,2,..., s}, that 
satisfies the Markov Property. Meaning, for A c S: 


P(X t+ 1 € A | X U ..., X t ) = P(X t+ 1 eA\X t ) 


Example: A Washing Machine 

The simplest example is a washing machine. It is either working or broken, up or down, 
with these states defining its state space. The set {up, down} being this state space. We 
model the washing machine as a stochastic process {X h t = 0,1,...} with X t taking on 
the value of either up or down in each time period t. Generically, if X t = x, then the 
process or machine is said to be in state x at time t. Further, we define the finite state space 
as S = {1,..., .v} which corresponds to {up, down} in our example. Next, we define the 
fixed probability of the machine transitioning from its current state in this time period to 
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another state (or back to the same state) via pij . In mathematical and probability notation: 
P(X t+ 1 = j | X t = i) = pij. Therefore, if the machine is currently up, then we indicate 
the probability of it going down in the next time period as p lip ,down • We also denote the 
probability of the machine staying up as p up ,u P • Further, by the Law of Total Probability, 
Pup.up + Pup, down = 1 or in layman’s terms, the machine must be either up or down in the 
next time period. This last can be expressed in the following two properties: 

1. For all j ; 0 < ptj < 1 (Probabilities must be between zero and one) 

m 

2. For alii; YjPij- 1 (Law of Total Probability) 

j =i 

Further, we can mathematically express the entire transition dynamics of our washing ma¬ 
chine process as a matrix of probabilities. We use the same notation with pi j corresponding 
to the fth, j’th element of our 2-dimensional transition probability matrix, P. In our 
example it could look like the following: 


p = 

Pup.up 

Pup.down 

_ 

".75 

. 25 " 


Pdown.up 

Pdown.down 


.80 

.20 


The above matrix captures the behavior of our simple washing machine example. But as 
you are no doubt thinking, all breakdowns are not equal. Some might be for minor damage 
while some might be of a much more extensive nature. This can also be captured by a 
Markov Chain, as can be seen in the below modified matrix. This method of expanding the 
state space to capture more detail is very useful and will enable us to apply the algorithms 
developed within this thesis to a much broader set of problems than would seem possible 
at first glance. As can be seen, we now have a couple zero probabilities that correspond to 
the machine breaking down due to minor damage and then somehow becoming broken for 
major damage (whether this is zero or not though ultimately depends on the system being 
modeled). 



Pup.up 

Pup,down m j n 

Pup,down ma j 


".75 

.20 

. 05 " 

Pmod ~~ 

Pdown m i n ,up 

Pdown m i n ,down mi „ 

Pdown min ,down ma j 

— 

.80 

.20 

0.0 


Pdown ma j,up 

Pdown ma j,down min 

Pdownmaj.down.maj _ 


.40 

0.0 

.60 
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